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Abstract 

This paper presents a new method for a quick similarity-based search through long unlabeled audio streams 
to detect and locate audio clips provided by users. The method involves feature-dimension reduction based on a 
piecewise linear representation of a sequential feature trajectory extracted from a long audio stream. Two techniques 
enable us to obtain a piecewise linear representation: the dynamic segmentation of feature trajectories and the 
segment-based Karhunen-Loeve (KL) transform. The proposed search method guarantees the same search results 
as the search method without the proposed feature-dimension reduction method in principle. Experiment results 
indicate significant improvements in search speed. For example the proposed method reduced the total search time 
to approximately 1/12 that of previous methods and detected queries in approximately 0.3 seconds from a 200-hour 
audio database. 

Index Terms 

audio retrieval, audio fingerprinting, content identification, feature trajectories, piecewise linear representation, 
dynamic segmentation 

I. Introduction 

This paper presents a method for searching quickly through unlabeled audio signal archives (termed 
stored signals) to detect and locate given audio clips (termed query signals) based on signal similarities. 

Many studies related to audio retrieval have dealt with content-based approaches such as audio content 
classification fl], fl2[, speech recognition (3), and music transcription J3), H|. Therefore, these studies 
mainly focused on associating audio signals with their meanings. In contrast, this study aims at achieving 
a similarity-based search or more specifically fingerprint identification, which constitutes a search of and 
retrieval from unlabeled audio archives based only on a signal similarity measure. That is, our objective is 
signal matching, not the association of signals with their semantics. Although the range of applications for 
a similarity-based search may seem narrow compared with content-based approaches, this is not actually 
the case. The applications include the detection and statistical analysis of broadcast music and commercial 
spots, and the content identification, detection and copyright management of pirated copies of music clips. 
Fig. [T] represents one of the most representative examples of such applications, which has already been 
put to practical use. This system automatically checks and identifies broadcast music clips or commercial 
spots to provide copyright information or other detailed information about the music or the spots. 

In audio fingerprinting applications, the query and stored signals cannot be assumed to be exactly 
the same even in the corresponding sections of the same sound, owing to, for example, compression, 
transmission and irrelevant noises. Meanwhile, for the applications to be practically viable, the features 
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Fig. 1. Automatic monitoring system of broadcast content via music content identification. 



should be compact and the feature analysis should be computationally efficient. For this purpose, several 
feature extraction methods have been developed to attain the above objectives. Cano et al. fl5) modeled 
music segments as sequences of sound classes estimated via unsupervised clustering and hidden Markov 
models (HMMs). Burges et al. fl6) employed several layers of Karhunen-Loeve (KL) transforms, which 
reduced the local statistical redundancy of features with respect to time, and took account of robustness to 
shifting and pitching. Oostveen et al. represented each frame of a video clip as a binary map and used 
the binary map sequence as a feature. This feature is robust to global changes in luminance and contrast 
variations. Haitsma et al. [|8|| and Kurozumi et al. flU each employed a similar approach in the context 
of audio fingerprinting. Wang iflOl developed a feature-point-based approach to improve the robustness. 
Our previous approach called the Time-series Active Search (TAS) method Ifm introduced a histogram 
as a compact and noise-robust fingerprint, which models the empirical distribution of feature vectors in 
a segment. Histograms are sufficiently robust for monitoring broadcast music or detecting pirated copies. 
Another novelty of this approach is its effectiveness in accelerating the search. Adjacent histograms 
extracted from sliding audio segments are strongly correlated with each other. Therefore, unnecessary 
matching calculations are avoided by exploiting the algebraic properties of histograms. 

Another important research issue regarding similarity-based approaches involves finding a way to speed 
up the search. Multi-dimensional indexing methods lTl2ll . lfT3ll have frequently been used for accelerating 
searches. However, when feature vectors are high-dimensional, as they are typically with multimedia 
signals, the efficiency of the existing indexing methods deteriorates significantly ||T4) . 03). This is why 
search methods based on linear scans such as the TAS method are often employed for searches with 
high-dimensional features. However, methods based solely on linear scans may not be appropriate for 
managing large-scale signal archives, and therefore dimension reduction should be introduced to mitigate 
this effect. 

To this end, this paper presents a quick and accurate audio search method that uses dimensionality 
reduction of histogram features. The method involves a piecewise linear representation of histogram 
sequences by utilizing the continuity and local correlation of the histogram sequences. A piecewise linear 
representation would be feasible for the TAS framework since the histogram sequences form trajectories 
in multi-dimensional spaces. By incorporating our method into the TAS framework, we significantly 
increase the search speed while guaranteeing the same search results as the TAS method. We introduce 
the following two techniques to obtain a piecewise representation: the dynamic segmentation of the feature 
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trajectories and the segment-based KL transform. 

The segment-based KL transform involves the dimensionality reduction of divided histogram sequences 
(called segments) by KL transform. We take advantage of the continuity and local correlation of feature 
sequences extracted from audio signals. Therefore, we expect to obtain a linear representation with 
few approximation errors and low computational cost. The segment-based KL transform consists of 
the following three components: The basic component of this technique reduces the dimensionality of 
histogram features. The second component that utilizes residuals between original histogram features and 
features after dimension reduction greatly reduces the required number of histogram comparisons. Feature 
sampling is introduced as the third component. This not only saves the storage space but also contributes 
to accelerating the search. 

Dynamic segmentation refers to the division of histogram sequences into segments of various lengths 
to achieve the greatest possible reduction in the average dimensionality of the histogram features. One of 
the biggest problems in dynamic segmentation is that finding the optimal set of partitions that minimizes 
the average dimensionality requires a substantial calculation. The computational time must be no more 
than that needed for capturing audio signals from the viewpoint of practical applicability. To reduce the 
calculation cost, our technique addresses the quick suboptimal partitioning of the histogram trajectories, 
which consists of local optimization to avoid recursive calculations and the coarse-to-fine detection of 
segment boundaries. 

This paper is organized as follows: Section [II] introduces the notations and definitions necessary for 
the subsequent explanations. Section III explains the TAS method upon which our method is founded. 
Section [TV] outlines the proposed search method. Section [V] discusses a dimensionality reduction technique 



with the segment-based KL transform. Section [VT] details dynamic segmentation. Section |VII| presents 
experimental results related to the search speed and shows the advantages of the proposed method. Section 



VIII further discusses the advantages and shortcomings of the proposed method as well as providing 



additional experimental results. Section IX concludes the paper. 



II. Preliminaries 

Let J\f be the set of all non-negative numbers, 1Z be the set of all real numbers, and J\f n be a n-ary 
Cartesian product of J\f. Vectors are denoted by boldface lower-case letters, e.g. x, and matrices are 
denoted by boldface upper-case letters, e.g. A. The superscript t stands for the transposition of a vector 
or a matrix, e.g. x l or A 1 . The Euclidean norm of an n-dimensional vector x E lZ n is denoted as ||aj||: 

nii d =' (l>*i 2 j > 

where \x\ is the magnitude of x. For any function /(■) and a random variable X, E[f(X)] stands for the 
expectation of f(X). Similarly, for a given value y 6 y, some function #(•,-) an d a random variable X, 
E[f(X,y)\y) stands for the conditional expectation of g(X,y) given y. 



III. Time-series Active Search 

Fig. [2] outlines the Time-series Active Search (TAS) method, which is the basis of our proposed method. 
We provide a summary of the algorithm here. Details can be found in ifTTI . 
[Preparation stage] 

1) Base features are extracted from the stored signal. Our preliminary experiments showed that the 
short-time frequency spectrum provides sufficient accuracy for our similarity-based search task. Base 
features are extracted at every sampled time step, for example, every 10 msec. Henceforth, we call 
the sampled points frames (the term was inspired by video frames). Base features are denoted as 
fs(ts) (0 — ts < Ls), where ts represents the position in the stored signal and Ls is the length 
of the stored signal (i.e. the number of frames in the stored signal). 



Fig. 2. Overview of the Time-series Active Search (TAS) method. 



2) Every base feature is quantized by vector quantization (VQ). A codebook {/j}™ =1 is created 
beforehand, where n is the codebook size (i.e. the number of codewords in the codebook). We 
utilize the Linde-Buzo-Gray (LBG) algorithm [fT6ll for codebook creation. A quantized base feature 
Qs(ts) is expressed as a VQ codeword assigned to the corresponding base feature f s (t s ), which is 
determined as 

q s (t s ) = arg min \\f s (t s ) -Jj 2 . 

l<i<n 

[Search stage] 

1) Base features fn(tQ) (0 < t<Q < Lq) of the query signal are extracted in the same way as the 
stored signal and quantized with the codebook {f { }^ =1 created in the preparation stage, where t Q 
represents the position in the query signal and Lq is its length. We do not have to take into account 
the calculation time for feature quantization since it takes less than 1% of the length of the signal. 
A quantized base feature for the query signal is denoted as qqitq). 

2) Histograms are created; one for the stored signal denoted as Xs(ts) and the other for the query 
signal denoted as xq . First, windows are applied to the sequences of quantized base features 
extracted from the query and stored signals. The window length W (i.e. the number of frames in 
the window) is set at W = Lq, namely the length of the query signal. A histogram is created by 
counting the instances of each VQ codeword over the window. Therefore, each index of a histogram 
bin corresponds to a VQ codeword. We note that a histogram does not take the codeword order into 
account. 

3) Histogram matching is executed based on the distance between histograms, computed as 

d(t s ) d =' \\x s (ts) - x Q \\. 

When the distance d(ts) falls below a given value (search threshold) 9, the query signal is considered 
to be detected at the position t$ of the stored signal. 

4) A window on the stored signal is shifted forward in time and the procedure returns to Step 2). As 
the window for the stored signal shifts forward in time, VQ codewords included in the window 
cannot change so rapidly, which means that histograms cannot also change so rapidly. This implies 
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Fig. 3. Overview of proposed search method. 



that for a given positive integer w the lower bound on the distance d(ts + w) is obtained from the 
triangular inequality as follows: 

d(ts + w) > max{0, d(ts) — V2w}, 

where is the maximum distance between xs(ts) and Xs{ts + w). Therefore, the skip width 
w(ts) of the window at the ts-th frame is obtained as 

w{t s ) 

floor (^7T^) + 1 ^ d{ts)>6) (i) 

1, (otherwise) 

where floor(a) indicates the largest integer less than a. We note that no sections will ever be missed 
that have distance values smaller than the search threshold 9, even if we skip the width w(t s ) given 
by Eq. Q. 

IV. Framework of proposed search method 

The proposed method improves the TAS method so that the search is accelerated without false dismissals 
(incorrectly missing segments that should be detected) or false detections (identifying incorrect matches). 
To accomplish this, we introduce feature-dimension reduction as explained in Sections [V] and VI, which 
reduces the calculation costs required for matching. 

Fig. [3] shows an overview of the proposed search method, and Fig. [4] outlines the procedure for feature- 
dimension reduction. The procedure consists of a preparation stage and a search stage. 
[Preparation stage] 

1) Base features fg(ts) are extracted from the stored signal and quantized, to create quantized base 
features qs{ts)- The procedure is the same as that of the TAS method. 

2) Histograms xs{ts) are created in advance from the quantized base features of the stored signal by 
shifting a window of a predefined length W. We note that with the TAS method the window length 
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Fig. 4. Overview of the procedure for obtaining compressed features. 



W varies from one search to another, while with the present method the window length W is fixed. 
This is because histograms xs(ts) for the stored signal are created prior to the search. We should 
also note that the TAS method does not create histograms prior to the search because sequences of 
VQ codewords need much less storage space than histogram sequences. 

3) A piecewise linear representation of the extracted histogram sequence is obtained (Fig. |4]block (A)). 
This representation is characterized by a set T = {tj}jL of segment boundaries expressed by their 
frame numbers and a set {pj(-)}j£i of M functions, where M is the number of segments, to = 
and t M = L s . The j-th segment is expressed as a half-open interval [tj-x,tj) since it starts from 
x s{tj-i) an d ends at x s {tj — 1). Section VI shows how to obtain such segment boundaries. Each 
function pj(-) : J\f n — > H m i that corresponds to the j-th segment reduces the dimensionality n of 
the histogram to the dimensionality rnj. Section V-B shows how to determine these functions. 

4) The histograms Xs(ts) are compressed by using the functions {pj{-)}jLi obtained in the previous 
step, and then compressed features y s (t s ) are created (Fig. [4] block (B)). Section V-C details how 
to create compressed features. 

5) The compressed features y s (ts) are sampled at regular intervals (Fig. H] block (C)). The details are 



presented in Section V-D 
[Search stage] 



1) 



2) 



3) 



4) 



5) 



Base features fq^q) are extracted and a histogram xq is created from the query signal in the same 
way as the TAS method. 

The histogram xq is compressed based on the functions {p J (-)}^£ 1 obtained in the preparation stage, 
to create M compressed features yq[j] (j = 1, • • • , M). Each compressed feature yn[j) corresponds 
to the j-th function Pj(-). The procedure used to create compressed features is the same as that for 
the stored signal. 

Compressed features created from the stored and query signals are matched, that is, the distance 
d(ts) — \\ys(ts) — yQ[jt 3 ]\\ between two compressed features y s (ts) and yn[jt 3 ] is calculated, 
where j ts represents the index of the segment that contains xs(ts), namely tj ts -i < ts < tjt s - 
If the distance falls below the search threshold 9, the original histograms x s (t s ) corresponding to 
the surviving compressed features y s (ts) are verified. Namely, the distance d(t s ) = \\x s (t s ) — Xq\\ 
is calculated and compared with the search threshold 9. 

A window on the stored signal is shifted forward in time and the procedure goes back to Step 3). 
The skip width of the window is calculated from the distance d(ts) between compressed features. 



Fig. 5. Intuitive illustration of piecewise linear representation. 



V. Dimension reduction based on piecewise linear representation 
A. Related work 

In most practical similarity-based searches, we cannot expect the features to be globally correlated, and 
therefore there is little hope of reducing dimensionality over entire feature spaces. However, even when 
there is no global correlation, feature subsets may exist that are locally correlated. Such local correlation 
of feature subsets has the potential to further reduce feature dimensionality. 

A large number of dimensionality reduction methods have been proposed that focused on local cor- 
relation (e.g. ifTTll . [fLSll . Ifl9l . Il20l0 . Many of these methods do not assume any specific characteristics. 
Now, we are concentrating on the dimensionality reduction of time- series signals, and therefore we take 
advantage of their continuity and local correlation. The computational cost for obtaining such feature 
subsets is expected to be very small compared with that of existing methods that do not utilize the 
continuity and local correlation of time- series signals. 

Dimensionality reduction methods for time-series signals are categorized into two types: temporal 
dimensionality reduction, namely dimensionality reduction along the temporal axis (e.g. feature sampling), 
and spatial dimensionality reduction, namely the dimensionality reduction of each multi-dimensional 
feature sample. Keogh et al. lETTl . Il22ll and Wang et al. Il23l have introduced temporal dimensionality 
reduction into waveform signal retrieval. Their framework considers the waveform itself as a feature for 
detecting similar signal segments. That is why they mainly focused on temporal dimensionality reduction. 
When considering audio fingerprinting, however, we handle sequences of high-dimensional features that 
are necessary to identify various kinds of audio segments. Thus, both spatial and temporal dimensionality 
reduction are required. To this end, our method mainly focuses on spatial dimensionality reduction. We 
also incorporate a temporal dimensionality reduction technique inspired by the method of Keogh et al. 
Il22ll . which is described in Section |V-D[ 



B. Segment-based KL transform 

Fig. [5] shows an intuitive example of a piecewise linear representation. Since the histograms are created 
by shifting the window forward in time, successive histograms cannot change rapidly. Therefore, the 
histogram sequence forms a smooth trajectory in an n-dimensional space even if a stored audio signal 
includes distinct non- sequential patterns, such as irregular drum beats and intervals between music clips. 
This implies that a piecewise lower-dimensional representation is feasible for such a sequential histogram 
trajectory. 

As the first step towards obtaining a piecewise representation, the histogram sequence is divided into 
M segments. Dynamic segmentation is introduced here, which enhances feature-dimension reduction 
performance. This will be explained in detail in Section |VIj Second, a KL transform is performed for 
every segment and a minimum number of eigenvectors are selected such that the sum of their contribution 
rates exceeds a predefined value a, where the contribution rate of an eigenvector stands for its eigenvalue 
divided by the sum of all eigenvalues, and the predefined value a is called the contribution threshold. The 
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number of selected eigenvectors in the j-th segment is written as rrij. Then, a function pj(-) : J\f n — > lZ mj 
(j = 1,2, •• • , M) for dimensionality reduction is determined as a map to a subspace whose bases are 
the selected eigenvectors: 

Vj {x) = Pfo-Xj), (2) 

where a; is a histogram, Xj is the centroid of histograms contained in the j-th segment, and Pj is an 
(n x rrij) matrix whose columns are the selected eigenvectors. Finally, each histogram is compressed by 
using the function pj(-) of the segment to which the histogram belongs. Henceforth, we refer to Pj(x) as 
a projected feature of a histogram x. 

In the following, we omit the index j corresponding to a segment unless it is specifically needed, e.g. 
p(x) and x. 



C. Distance bounding 

From the nature of the KL transform, the distance between two projected features gives the lower bound 
of the distance between corresponding original histograms. However, this bound does not approximate 
the original distance well, and this results in many false detections. 

To improve the distance bound, we introduce a new technique. Let us define a projection distance 
8(p, x) as the distance between a histogram x and the corresponding projected feature z = p(x): 

S(p,x) d =' \\x-q(z)\\, (3) 

where g(-) : 1Z m — > lZ n is the generalized inverse map of p(-), defined as 

/ N def. D _ 
q{z) = Pz + x. 

Here we create a compressed feature y, which is the projected feature z = (zi, z 2 , ■ • ■ , z m Y along with 
the projection distance 5(p, x): 

y = y(p,x) = (zt,Z2, ■ ■ ■ ,z m ,5{p,x)Y, 

where y(p,x) means that y is determined by p and x. The Euclidean distance between compressed 
features is utilized as a new criterion for matching instead of the Euclidean distance between projected 
features. The distance is expressed as 

Wvs-VqW 2 

= \\z s -Zq\\ 2 + {5(p,x s ) -5(p,x Q )} 2 , (4) 

where z s = p(x s ) (resp. Zq = p(x Q )) is the project feature derived from the original histograms x$ 
(resp. xq) and y s = y s (p,x s ) (resp. y Q = y Q (p,XQ)) is the corresponding compressed feature. Eq. Q 
implies that the distance between compressed features is larger than the distance between corresponding 
projected features. In addition, from the above discussions, we have the following two properties, which 
indicate that the distance \\y s — yg\\ between two compressed features is a better approximation of the 
distance \\x s — Xq\\ between the original histograms than the distance \\z s — Zq\\ between projected 
features (Theorem [T]), and the expected approximation error is much smaller (Theorem [2]). 
Theorem 1: 

\\z s -z Q \\ < \\Vs-Vq\\ 
= min \\x s - x Q \\ < \\x s - x Q \\, (5) 

(xs,x Q )GA(y s ,y Q ) 

where A(y s , y Q ) is the set of all possible pairs (xs, xq) of original histograms for given compressed 
features {y s ,y Q )- 
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Fig. 6. Intuitive illustration of relationships between projection distance, distance between projected features and distance between 
compressed features. 

Theorem 2: Suppose that random variables (Xg, Xq) corresponding to the original histograms (x$, xq) 
have a uniform distribution on the set A(y s , y Q ) defined in Theorem [TJ and E[S(p, Xg)] ^> E[S(p, Xq)]. 
The expected approximation errors can be evaluated as 

E[\\X n s -Xlf-\\y s -y Q f\y s ,y Q ] 
« E[\\X%-X%\\ 2 -\\z s -z Q \\ 2 \y s ,y Q ]. (6) 

The proofs are shown in the appendix. Fig. [6] shows an intuitive illustration of the relationships between 
projection distances, distances between projected features and distances between compressed features, 
where the histograms are in a 3-dimensional space and the subspace dimensionality is 1. In this case, for 
given compressed features (y s , y Q ) and a fixed query histogram xq, a stored histogram x$ must be on 
a circle whose center is q(z Q ). This circle corresponds to the set A(y s ,y Q ). 

D. Feature sampling 

In the TAS method, quantized base features are stored, because they need much less storage space 
than the histogram sequence and creating histograms on the spot takes little calculation. With the present 
method, however, compressed features must be computed and stored in advance so that the search results 
can be returned as quickly as possible, and therefore much more storage space is needed than with the 
TAS method. The increase in storage space may cause a reduction in search speed due to the increase in 
disk access. 

Based on the above discussion, we incorporate feature sampling in the temporal domain. The follow- 
ing idea is inspired by the technique called Piecewise Aggregate Approximation (PAA) ll22l . With the 
proposed feature sampling method, first a compressed feature sequence {y sits)}^^' 1 is divided into 
subsequences 

{y s (ia), y s (ia + 1), • • • , y s (ia + a- l)}i=o,i,- 

of length a. Then, the first compressed feature y s (ia) of every subsequence is selected as a representative 
feature. A lower bound of the distances between the query and stored compressed features contained in 
the subsequence can be expressed in terms of the representative feature y s {ia). This bound is obtained 
from the triangular inequality as follows: 

\\y s (ia + k) - y Q \\ > \\y s (ia) - y Q \\ - d(i), 
^) ='n^ x , \\ys( ia + k ') -Vs( ia )\l 

0<k'<a— 1 

(Vi = 0, 1,---, Vfc = (),••• ,a-l) 
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This implies that preserving the representative feature y s (ia) and the maximum distance d(i) is sufficient 
to guarantee that there are no false dismissals. 

This feature sampling is feasible for histogram sequences because successive histograms cannot change 
rapidly. Furthermore, the technique mentioned in this section will also contribute to accelerating the search, 
especially when successive histograms change little. 

VI. Dynamic segmentation 

A. Related work 

The approach used for dividing histogram sequences into segments is critical for realizing efficient 
feature-dimension reduction since the KL transform is most effective when the constituent elements in 
the histogram segments are similar. To achieve this, we introduce a dynamic segmentation strategy. 

Dynamic segmentation is a generic term that refers to techniques for dividing sequences into segments 
of various lengths. Dynamic segmentation methods for time-series signals have already been applied to 
various kinds of applications such as speech coding (e.g. [|24|). the temporal compression of waveform 
signals [|25l , the automatic segmentation of speech signals into phonic units 126), sinusoidal modeling 
of audio signals Il27ll . ll28Tl . [|29ll and motion segmentation in video signals QUI . We employ dynamic 
segmentation to minimize the average dimensionality of high-dimensional feature trajectories. 

Dynamic segmentation can improve dimension reduction performance. However, finding the optimal 
boundaries still requires a substantial calculation. With this in mind, several studies have adopted subop- 
timal approaches, such as longest line fitting [|23l , wavelet decomposition ||23l , ETI and the bottom- up 
merging of segments OTI . The first two approaches still incur a substantial calculation cost for long 
time-series signals. The last approach is promising as regards obtaining a rough global approximation at 
a practical calculation cost. This method is compatible with ours, however, we mainly focus on a more 
precise local optimization. 

B. Framework 

Fig. [7] shows an outline of our dynamic segmentation method. The objective of the dynamic segmentation 
method is to divide the stored histogram sequence so that its piecewise linear representation is well 
characterized by a set of lower dimensional subspaces. To this end, we formulate the dynamic segmentation 
as a way to find a set T* = {t*}^L of segment boundaries that minimize the average dimensionality of 
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these segment- approximating subspaces on condition that the boundary t* between the j-th and the + 
th segments is in a shiftable range Sj, which is defined as a section with a width A in the vicinity of the 
initial position t® of the boundary between the j-th and the (j + l)-th segments. Namely, the set T* of 
the optimal segment boundaries is given by the following formula: 

T * = Kl-lo 

d =' argmin — YV^. - t,_i) ■ c(tj^x, tj, a) (7) 

Sj d = {tj : t°j - A < tj < t°j + A} (8) 

where c(U,tj,a) represents the subspace dimensionality on the segment between the U-th and the t,-th 
frames for a given contribution threshold a, t* Q = and t* M = L s . The initial positions of the segment 
boundaries are set beforehand by equi-partitioning. 

The above optimization problem defined by Eq. ([7]) would normally be solved with dynamic program- 
ming (DP) (e.g. Il32l0 . However, DP is not practical in this case. Deriving c(£j_i, tj, a) included in Eq. (j7]) 
incurs a substantial calculation cost since it is equivalent to executing a KL transform calculation for the 
segment [tj-i,tj). This implies that the DP-based approach requires a significant amount of calculation, 
although less than a naive approach. The above discussion implies that we should reduce the number of 
KL transform calculations to reduce the total calculation cost required for the optimization. When we 
adopt the total number of KL transform calculations as a measure for assessing the calculation cost, the 
cost is evaluated as 0(MA 2 ), where M is the number of segments and A is the width of the shiftable 
range. 

To reduce the calculation cost, we instead adopt a suboptimal approach. Two techniques are incorporated: 
local optimization and the coarse-to-fine detection of segment boundaries. We explain these two techniques 
in the following sections. 



C. Local optimization 

The local optimization technique modifies the formulation (Eq. ([TJ)) of dynamic segmentation so that it 
minimizes the average dimensionality of the subspaces of adjoining segments. The basic idea is similar to 
the "forward segmentation" technique introduced by Goodwin [|27ll . [28 J for deriving accurate sinusoidal 
models of audio signals. The position t* of the boundary is determined by using the following forward 
recursion as a substitute for Eq. ([TJ): 

tj = argmin J _ ' (9) 

j+i j— l 



which is here given by 



Cj — c(tj_ l7 tj,(T), c°j +1 — c(tj,t® +1 , a 



and Sj is defined in Eq. ([8]). As can be seen in Eq. Q, we can determine each segment boundary 
independently, unlike the formulation of Eq. ([7]). Therefore, the local optimization technique can reduce 
the amount of calculation needed for extracting an appropriate representation, which is evaluated as 
O(MA), where M is the number of segments and A is the width of the shiftable range. 



D. Coarse-to-fine detection 

The coarse-to-fine detection technique selects suboptimal boundaries in the sense of Eq. (|9]) with less 
computational cost. We note that small boundary shifts do not contribute greatly to changes in segment 
dimensionality because successive histograms cannot change rapidly. With this in mind, we assume that 
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Fig. 9. Example 2: c(t*_ 1 ,tj , a) increases when the boundary tj is shifted forward in time. 



the optimal positions of the segment boundaries are at the edges of the shiftable range or at the points 
where dimensions change. Figs. [8] and [9] show two intuitive examples where the optimal position of 
the segment boundary may be at the point where dimensionality changes. The coarse-to-fine detection 
technique quickly finds the points where the dimensions change. The procedure for this technique has 
three steps. 

1) The dimensions of the j-th and (j + l)-th segments are calculated when the segment boundary t, 
is at the initial position tj and the edges — A and £° + A) of its shiftable range. 

2) The dimensions of the j-th and (j + l)-th segments are calculated when the segment boundary tj 
is at the position t° — A + (i = 1, 2, • • • , Uj), where Uj determines the number of calculations 
in this step. 

3) The dimensions of the j-th and (j + l)-th segments are calculated in detail when the segment 
boundary tj is in the positions where dimension changes are detected in the previous step. 

We determine the number Uj of dimension calculations in step 2 so that the number of calculations in all 
the above steps, fj(uj), is minimized. Then, fj(uj) is given as follows: 

f j (uj) = 2((3 + Uj) + Kj T -^--), 

where Kj is the estimated number of positions where the dimensionalities change, which is experimentally 
determined as 

Kj = c LR - c LL} 

(if c LR < c RR , c LL < c RL ) 
Kj = (c LC - c LL ) +mm(c RC ,c LR ) -mm(c LC ,c RR ), 

(if c LR > c RR , c LL < c RLl c LC < c RC ) 
Kj = (c RC - c RR ) + min(c LC , c RL ) - min(c RC , c LL ), 
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(if c LR > c RR , c LL < c RL , c lc > c RC ) 
Kj = c RL - c RR} (Otherwise) 

and 

c LL = c{t*_ l7 tj — A, cr), Crl = c{t°j - A, t° j+1 , a 
c(t*_ 1; q, a), c RC = c(t°, t° j+l , a] 

„(+* +0 I A rr\ — I A +0 



clc 

clr = c{i*_ x , t) +A,a), crr = c{t] + A, 



cr 



The first term of fj(uj) refers to the number of calculations in steps 1 and 2, and the second term 
corresponds to that in step 3. fj(uj) takes the minimum value A^/2KjA + 2 when Uj = ^2KjA — 2. The 
calculation cost when incorporating local optimization and coarse-to-fine detection techniques is evaluated 
as follows: 



M (Ay/2KjA + 2) < M (4V2KA 

= o(m\[ka 



E 



where K = E[Kj], M is the number of segments and A is the width of the shiftable range. The first 
inequality is derived from Jensen's inequality (e.g. Il33l Theorem 2.6.2]). The coarse-to-fine detection 
technique can additionally reduce the calculation cost because K is usually much smaller than A. 

VII. Experiments 

A. Conditions 

We tested the proposed method in terms of calculation cost in relation to search speed. We again note 
that the proposed search method guarantees the same search results as the TAS method in principle, and 
therefore we need to evaluate the search speed. The search accuracy for the TAS method was reported in 
a previous paper [11 J. In summary, for audio identification tasks, there were no false detections or false 
dismissals down to an S/N ratio of 20 dB if the query duration was longer than 10 seconds. 

In the experiments, we used a recording of a real TV broadcast. An audio signal broadcast from a 
particular TV station was recorded and encoded in MPEG-1 Layer 3 (MP3) format. We recorded a 200- 
hour audio signal as a stored signal, and recorded 200 15-second spots from another TV broadcast as 
queries. Thus, the task was to detect and locate specific commercial spots from 200 consecutive hours 
of TV recording. Each spot occurred 2-30 times in the stored signal. Each signal was first digitized at a 
32 kHz sampling frequency and 16 bit quantization accuracy. The bit rate for the MP3 encoding was 56 
kbps. We extracted base features from each audio signal using a 7-channel second-order IIR band-pass 
filter with Q = 10. The center frequencies at the filter were equally spaced on a log frequency scale. 
The base features were calculated every 10 milliseconds from a 60 millisecond window. The base feature 
vectors were quantized by using the VQ codebook with 128 codewords, and histograms were created 
based on the scheme of the TAS method. Therefore, the histogram dimension was 128. We implemented 



the feature sampling described in Section V-D and the sampling duration was a = 50. The tests were 
carried out on a PC (Pentium 4 2.0 GHz). 

B. Search speed 

We first measured the CPU time and the number of matches in the search. The search time we measured 



in this test comprised only the CPU time in the search stage shown in Section IV This means that the 
search time did not include the CPU time for any procedures in the preparation stage such as base 
feature extraction, histogram creation, or histogram dimension reduction for the stored signal. The search 
threshold was adjusted to 9 = 85 so that there were no false detections or false dismissals. We compared 
the following methods: 
(i) The TAS method (baseline). 
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Fig. 10. Relationship between average segment duration and search speed measured by the CPU time in the search: (Horizontal axis) 
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Fig. 11. Relationship between average segment duration and search speed measured by number of matches: (Horizontal axis) Average 
segment duration [200 hours - 1.2 minutes], which corresponds to the number of segments [1 - 10000], (Vertical axis) Number of matches. 



(ii) The proposed search method without the projection distance being embedded in the compressed 
features. 

(iii) The proposed search method. 

We first examined the relationships between the average segment duration (equivalent to the number of 
segments), the search time, and the number of matches. The following parameters were set for feature- 
dimension reduction: The contribution threshold was a = 0.9. The width of the shiftable range for dynamic 
segmentation was 500. 

Fig. [TO] shows the relationship between the average segment duration and the search time, where the 
ratio of the search speed of the proposed method to that of the TAS method (conventional method in the 
figure) is called the speed-up factor. Also, Fig. [TTJ shows the relationship between the average segment 
duration and the number of matches. Although the proposed method only slightly increased the number 
of matches, it greatly reduced the search time. This is because it greatly reduced the calculation cost per 
match owing to feature-dimension reduction. For example, the proposed method reduced the search time 
to almost 1/12 when the segment duration was 1.2 minutes (i.e. the number of segments was 10000). 
As mentioned in Section V-D the feature sampling technique also contributed to the acceleration of the 
search, and the effect is similar to histogram skipping. Considering the dimension reduction performance 
results described later, we found that those effects were greater than that caused by dimension reduction 
for large segment durations (i.e. a small number of segments). This is examined in detail in the next 



15 



uti.>n-H.>m 

Ulioll-H.MI 



<>nU'lhull.>ll-II.NI 



20 
111 



lo inn iooo 

Number af seanciils 
20 mtm - I- iiiinufes 

^c^mcnl dur;iluui 



10000 



Fig. 12. Dimension reduction performance based on contribution rates: (Horizontal axis) Segment duration [200 hours - 1.2 minutes], which 
corresponds to the number of segments [1 - 10000] (Vertical axis) Average dimensionality of projected features per sample. 




* i lo m n 

WitLlh ol' dlil'U-ihlc raiiMt 
III Ml HH I A.V KMH 

Dur;ilion ol'shiluihle kihlic 



Fig. 13. Dimension reduction performance of dynamic segmentation [Number of segments= 1000, contribution rate= 0.9]: (Horizontal 
axis) Width of shiftable range [1 - 5000] (Vertical axis) Proportion of the dimensionality derived from dynamic segmentation compared with 
that obtained in the initial state (i.e. equi-partitioning). 



section. We also found that the proposed method reduced the search time and the number of matches 
when the distance bounding technique was incorporated, especially when there were a large number of 
segments. 

VIII. Discussion 

The previous section described the experimental results solely in terms of search speed and the advan- 
tages of the proposed method compared with the previous method. This section provides further discussion 
of the advantages and shortcomings of the proposed method as well as additional experimental results. 

We first deal with the dimension reduction performance derived from the segment-based KL transform. 
We employed equi-partitioning to obtain segments, which means that we did not incorporate the dynamic 



segmentation technique. Fig. 12 shows the experimental result. The proposed method monotonically 
reduced the dimensions as the number of segments increased if the segment duration was shorter than 10 
hours (the number of segments M > 20). We can see that the proposed method reduced the dimensions, 
for example, to 1/25 of the original histograms when the contribution threshold was 0.90 and the segment 
duration was 1.2 minutes (the number of segments was 10000). The average dimensions did not decrease 
as the number of segments increased if the number of segments was relatively small. This is because we 
decided the number of subspace bases based on the contribution rates. 
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Fig. 14. Amount of computation for dynamic segmentation [Number of segments= 1000, contribution rate= 0.9] (Horizontal axis) Width 
of shiftable range [1 - 1000] (Vertical axis) Total number of PC A calculations needed to obtain the representation, where the horizontal line 
along with "Real-time processing" indicates that the computational time is almost the same as the duration of the stored signal. 



Next, we deal with the dimension reduction performance derived from the dynamic segmentation 
technique. The initial positions of the segment boundaries were set by equi-partitioning. The duration 
of segments obtained by equi-partitioning was 12 minutes (i.e. there were 1000 segments). Fig. 13 shows 
the result. The proposed method further reduced the feature dimensionality to 87.5% of its initial value, 
which is almost the same level of performance as when only the local search was utilized. We were unable 
to calculate the average dimensionality when using DP because of the substantial amount of calculation, 
as described later. When the shiftable range was relatively narrow, the dynamic segmentation performance 
was almost the same as that of DP. 

Here, we review the search speed performance shown in Fig. 10 It should be noted that three techniques 
in our proposed method contributed to speeding up the search, namely feature-dimension reduction, 
distance bounding and feature sampling. When the number of segments was relatively small, the speed-up 
factor was much larger than the ratio of the dimension of the compressed features to that of the original 



histograms, which can be seen in Figs. 10 12 and 13 This implies that the feature sampling technique 



dominated the search performance in this case. On the other hand, when the number of segments was 
relatively large, the proposed search method did not greatly improve the search speed compared with the 
dimension reduction performance. This implies that the feature sampling technique degraded the search 
performance. In this case, the distance bounding technique mainly contributed to the improvement of the 



search performance as seen in Fig. 10 



Lastly, we discuss the amount of calculation necessary for dynamic segmentation. We again note that 
although dynamic segmentation can be executed prior to providing a query signal, the computational time 
must be at worst smaller than the duration of the stored signal from the viewpoint of practical applicability. 
We adopted the total number of dimension calculations needed to obtain the dimensions of the segments 



as a measure for comparing the calculation cost in the same way as in Section VI Fig. 14 shows the 



estimated calculation cost for each dynamic segmentation method. We compared our method incorporating 
local optimization and coarse-to-fine detection with the DP-based method and a case where only the local 
optimization technique was incorporated. The horizontal line along with "Real-time processing" indicates 
that the computational time is almost the same as the duration of the signal. The proposed method required 
much less computation than with DP or local optimization. For example, when the width of the shiftable 
range was 500, the calculation cost of the proposed method was 1/5000 that of DP and 1/10 that with 
local optimization. We note that in this experiment, the calculation cost of the proposed method is less 
than the duration of the stored signal, while those of the other two methods are much longer. 



IX. Concluding remarks 

This paper proposed a method for undertaking quick similarity-based searches of an audio signal to de- 
tect and locate similar segments to a given audio clip. The proposed method was built on the TAS method, 
where audio segments are modeled by using histograms. With the proposed method, the histograms are 
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compressed based on a piecewise linear representation of histogram sequences. We introduce dynamic 
segmentation, which divides histogram sequences into segments of variable lengths. We also addressed the 
quick suboptimal partitioning of the histogram sequences along with local optimization and coarse-to-fine 
detection techniques. Experiments revealed significant improvements in search speed. For example, the 
proposed method reduced the total search time to approximately 1/12, and detected the query in about 
0.3 seconds from a 200-hour audio database. Although this paper focused on audio signal retrieval, the 
proposed method can be easily applied to video signal retrieval 11341 , [[3311 . Although the method proposed 
in this paper is founded on the TAS method, we expect that some of the techniques we have described 
could be used in conjunction with other similarity-based search methods (e.g. Il36l . 113711 . Il38l . [[39l0 or a 
speech/music discriminator [|40l . Future work includes the implementation of indexing methods suitable 
for piecewise linear representation, and the dynamic determination of the initial segmentation, both of 
which have the potential to improve the search performance further. 



Appendix A 
Proof of Theorem [T] 



First, let us define 



def. , \ def. , N 

Z Q = P{ x Q)i Z S = P{X S ), 

def def 
Xq = 'q{z Q ) = q{p{x Q )), x s = 'q{z s ) = q(p(x s )), 

def def 

8 Q = '5(p,x Q ), 5 s = '5(p,x s ). 

We note that for any histogram x E M n , x = q(p(x)) is the projection of x into the subspace defined 
by the map p(-), and therefore x — x is a normal vector of the subspace of p(-). Also, we note that 
\\x — x\\ = 5(p,x) and x is on the subspace of p(-). For two vectors Xi and x 2 , their inner product is 
denoted as x 1 ■ x 2 . Then, we obtain 

\\x Q - x s \\ 2 

= \\{xq-Xq)-(x S -X S ) + (xq-X S )\\ 2 
= \\X Q - Xq\\ 2 + \\x S ~ X S \\ 2 + \\X Q - X S \\ 2 

-2(X Q - Xq) ■ (x S - X S ) + 2 ( X Q ~ Xq) ■ (xq - X S ) 

-2(x s - x s ) ■ (x Q - x s ) 
= 5(p,x Q ) 2 + 5(p,x s ) 2 + \\x Q - x s \\ 2 

-2(X Q - Xq) ■ (X S - X S ) (10) 



> 5(p,x q ) 2 + S(p,x s ) 2 + \\xq-x s 



|2 



-2S(p,x Q )-5(p,x s ) (11) 
= {S(p,x Q ) - 5(p,x s )} 2 + \\z Q - z s \\ 2 

= Wvq-VsW 2 , 

where Eq. ( [T0| ) comes from the fact that any vector on a subspace and the normal vector of the subspace 
are mutually orthogonal, and Eq. (jTT]) from the definition of inner product. This concludes the proof of 
Theorem [TJ 

Appendix B 
Proof of Theorem [2] 

The notations used in the previous section are also employed here. When the projected features Zq, 
zs and the projection distances 

$Q d ='S(p,x Q ), 5 s d ='5(p,x s ) 
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are given, we can obtain the distance between the original features as follows: 

\\ X Q ~ x s\\ 2 

= \\zq-z s \\ 2 + 6 2 +5 2 



s 



(x Q -q(z Q )) ■ (x s -q(z s )) 

-Q ~ z s \\ 2 + 5q + S 2 s - 25 q 5 s cos< 



(12) 



where Eq. (12) is derived from Eq. (10) and <p is the angle between Xq — q(zg) and x s — q(z s ). From 
the assumption that random variables Xs and Xq corresponding to original histograms x$ and xq are 
distributed independently and uniformly in the set A, the following equation is obtained: 



E \\\X, 



X< 



\\zq -zs\\ 2 } 
25q5 s cos0) 



Sn—rn—l (S s sin 0) 

S n -m(8s) 



\d(S s cos 0)|, 



(13) 



where St(R) represents the surface area of a A;-dimensional hypersphere with radius R, and can be 
calculated as follows: 



S k (R) 



71 



k/2 



my- 



R 



fc-1 



(14) 



Substituting Eq. (14) into Eq. ([13]), we obtain 



E | \\Xq — X s \\~ — 11^ 

n — m — 1 



n — m 
n — m — 1 



n — m 



2 -zs\\ 2 ] 

5n, 



where the last approximation comes from the fact that Sq ^> 5d- Also, from Eq. (|4]) we have 

|| x Q - x s || 2 - || y Q - y s || 2 = 25 Q 5 s (l - cos0). 
Therefore, we derive the following equation in the same way: 

\Vq -Vs\\ 2 ] 
-8q5 s 



E\\\X Q -X S \ 

n — m — 1 



n — m 



< E \ \\X 



X: 
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